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ABSTRACT 

Gravitational lensing and stellar dynamics are two independent methods, based solely on gravity, 
to study the mass distributions of galaxies. Both methods suffer from degeneracies, however, that are 
difficult to break. In this paper, we present a new framework that self-consistently unifies gravitational 
lensing and stellar dynamics, breaking some classical degeneracies that have limited their individual 
usage, particularly in the study of high-rcdshift galaxies. 

For any given galaxy potential, the mapping of both the unknown lensed source brightness distribu- 
tion and the stellar phase-space distribution function onto the photometric and kinematic observables 
can be cast as a single set of coupled linear equations, which are solved by maximizing the likelihood 
penalty function. The Bayesian evidence penalty function subsequently allows one to find the best 
potential-model parameters and to quantitatively rank potential-model families or other model as- 
sumptions (e.g. PSF). We have implemented a fast algorithm that solves for the maximum- likelihood 
pixelized lensed source brightness distribution and the two-integral stellar phase-space distribution 
function f(E, L z ), assuming axisymmetric potentials. To make the method practical, we have devised 
a new Monte Carlo approach to Schwarzschild's orbital superposition method, based on the superpo- 
sition of two-integral (E and L z ) toroidal components, to find the maximum-likelihood two- integral 
distribution function in a matter of seconds in any axisymmetric potential. The non linear parameters 
of the potential are subsequently found through a hybrid MCMC and Simplex optimization of the 
evidence. Illustrated by the power-law potential models of Evans, we show that the inclusion of stel- 
lar kinematic constraints allows the correct linear and non linear model parameters to be recovered, 
including the potential strength, oblateness and inclination, which, in the case of gravitational-lensing 
constraints only, would otherwise be fully degenerate. 

Subject headings: gravitational lensing — stellar dynamics — galaxies: structure — galaxies: elliptical 
and lenticular, cD 



1. INTRODUCTION 

Understanding the formation and the evolution of 
early-type galaxies is one of the most important open 
problems in present-day astrophysics and cosmology. 
Within the standard ACDM paradigm, massive ellipti- 
cals are predicted to be formed via hierarchical merging 
of lower mass galaxies (lToomrelll977t iFrenk et aLlfl988t 



Whi te fc Frenklil99H [B arne s 1992 nCoie" et al. 2000). De 
spite the many theoretical and observational successes 
of this scenario, several important features of early- 
type galaxies are still left unexplained. In particular, 
the origin of the (often strikingly tight) empirical scal- 
ing laws that correlate the global properties of ellip- 
ticals remain unexp l ained : (i) the fundamenta l plane 
(jDiorgovski fc Davis! 119871 ; iDressler et ail I1987D . relat- 
ing effective radius, velocity dispe rsion and effective sur- 
face brightness; (ii) t he Mrh — a (iMagorrian et al.lll998h 
iFerrarese fc Merritt] l2000t iGebhardt et al.ll2000fl . relat- 
ing the mass of the central supermassive black hole 
hosted b y the galaxy with its velocity dispersion; (iii) the 
color-cr feower Lucev. fc Ellid I1992D and the Mgp - a 
(iGuzman et all Il992t Bender. Burstein. fc Faberl Il993t 
iBernardi et al.l2003h relating the velocity dispersion with 
the stellar ages and the chemical composition of the 
galaxy. 

Each of these scaling relations relate structural, 
(spectro-) photometric, and dynamical (i.e. stellar ve- 
locity dispersion) quantities. Whereas the first two are 



solely based on the observed stellar component, the latter 
is a function of the stellar and dark matter mass distri- 
bution. Hence, a detailed study of the inner mass profile 
of early-type galaxies at different redshifts is undoubt- 
edly necessary to properly address the numerous issues 
related with the formation and the evolution of these ob- 
jects and their scaling relations, and it would also consti- 
tute an excellent test bed for the validity of the ACDM 
scenario on small (i.e. non linear) scales. 

Today, a large number of thorough stellar dy- 
namic and X-ray studies have been conducted to 
probe the mass structure of nea r by (z 5, 0-1) early- 
type galaxies dFabbianol 119891 : iMould et al.l 
Saglia. Bcrtin. & Stia vellil fl992t iBertin et alT 



Frank, van Gorkom. fc de Zeeuwl 119941; ICarollo 



1995t lArnaboldi et all 119961: iRix et al.l 



Matsush ita et al. 19981 ; ILoewenst ein & White 

^ — h — — 



Gerhard et al.l 12001b iSeliakl 200a de Zeeuw et al.l 12002. 
Borriello. Salucci. fc Danesd 120031: l Romanowsk v_et al.l 
2006t iGavazzi et alj 



2003; ICappellari et al 



2007), 



most (but not all) cases finding evidence for the pres- 
ence of a dark matter halo component and for a flat 
equivalent rotation curve in the inner regions. 

When it comes to distant (z > 0.1) early-type 
galaxies, however, only relatively little is known. 
The two main diagnostic tools which can be em- 
ployed t o carry out such studie s, namely gravitational 
lensing (| Schneider et al. I [2006) and stellar dynamics 
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(|Binnev fc Tremaind Il987f ). both suffer from limita- 
tions and degeneracies. Gravitational lensing provides 
an accurate and almost model independent determi- 
nation of the total mass projected within the Ein- 
stein radius, but a reliable determination of the mass 
density profile is often prevented by the effect s of 
the mass sheet ([Falco. Gorenstein. fc Shapiro! 1 1 98*51 ) and 
the related ma s s profile and shape degeneracies (e.g. 
Wucknitd [2001 lEvans fe Wittl l200l ISaha fe Wi lliams 



20061) . Despite these limitations, strong gravitational 
lensing has provided the first, although heterogeneous, 
constraints on the internal mass distributio n of stellar 
and dark matter in high-r e dshift galaxies (e.g.lKochanekl 
19951: IRusin fe Mai 120011: [Mai 120031; IRusin et all | 2002|: 
Cohn et all 120011: iMunoz et al.lT2001; Winn et al.l 200S 



Wucknjtz et al. 2004; 
20051 iDve fe Warrer 
Dobke fe K ing 2006). 



Fcrr eras et al.l 120051: 
120051 : iBrewer fe Lewisl T2006t 



Analyses based on stellar dynamics are limited by the 
paucity of bright kinematic tracers in the outer regions 
of early-type galaxies (e.g. I Gerh ard 2006). Moreover, 
there is a severe degeneracy (the mass-anisotropy degen- 
eracy) between the mass profile of the galaxy and the 
anisotropy of its stellar velocity dispersion tensor (e.g. 
lGerhardf ll993). Higher order velocity moments, which 
could be used to overcome this difficulty, cannot be mea- 
sured in reasonable integration time for such distant sys- 
tems with any of the current instruments. 

A viable solution to break these degeneracies to a 
large extent is a joint analysis which combines con- 
straints fr om both gravitational l ensing and stellar 
dyna mics (IKoopmans fc Treul 120021: ITreu fc Koopmansl 
2002L 120031: IKoopmans fc TreuN 2003: Ru sin et all ~ 



20M 



Treu fc Koopmansl 120041: IRusin fe Kochaneld 120051: 
Bolton et all 120061: ITreu et all 120061 : IKoopmans et all 
20061 : iGavazzi et al.1 l2007| ): the two approaches can 
be described as almost "orthogonal", in other words 
they comp l ement each other very effectively (see e.g. 
IKoopmans! 12004 for some simple scaling relations), 
allowing a trustworthy determination of the mass 
density profile of the lens galaxy in the inner regions 
(i.e. within the Einstein radius or the effective radius, 
whichever is larger). A joint gravitational lensing and 
stellar dynamics study of the Lenses Structure and Dy- 
namics (LSD) and Sloan Lens ACS (SLACS) samples of 
early - type galaxies (IKoopmans et ail 120061 : iBolton et al.l 
120061: ITreu et all |2006j), for example, showed that the 
total mass density has an average logarithmic slope 
extremely close to isothermal ((s') = 2.01 ± 0.03, 
assuming ptot oc r~ s ) with a very small intrinsic spread 
of at most 6%. They also find no evidence for any 
secular evolution of the inner density sl ope of early-type 
galaxies between z = 0.08 and 1.01 (|Koopmans et all 



gal; 
200 



These authors, in their combined analysis, make use 
of both gravitational lensing and stellar dynamics in- 
formation, but treat the two problems as independent. 
More specifically, the projected mass distribution of 
the lens galaxy is modeled as a singular isothermal el- 
lipsoid (SIE; e.g. iKormann et alj [1994), and the ob- 
tained value for the total mass within the Einstein ra- 
dius is used as a constraint for the dynamical model- 
ing; in the latter the galaxy is assumed to be spherical 



and the stellar orbit anisotropy to be constant (or fol- 
low an Osipkov-Merritt profile; lOsipkovl 11979b iMerrittl 
fl985airb1 ) in the inner regions and therefore the spherical 
Jeans equat ions can be solved in order to determine the 
slope s' (see Koopmans & Treu 2003; Trcu & Koopmans 
120041 : IKoopmans et al.l 2006. for a full discussion of the 
methodology). This approach, however, while being ro- 
bust and successful, is not fully self-consistent: gravi- 
tational lensing and the stellar dynamics use different 
assumptions about the symmetry of the system and dif- 
ferent potentials for the lens galaxy. 

This has motivated us to develop a completely rig- 
orous and self-consistent framework to carry out com- 
bined gravitational lensing and stellar dynamics stud- 
ies of early-type galaxies, the topic of the present pa- 
per. The methodology that we introduce, in principle, is 
completely general and allows - given a set of data from 
gravitational lensing (i.e. the surface brightness distribu- 
tion of the lensed images) and stellar dynamics (i.e. the 
surface brightness distribution and the line-of-sight pro- 
jected velocity moments of the lens galaxy) - the "best" 
parameters, describing the gravitational potential of the 
galaxy and its stellar phase-space distribution function, 
to be recovered. 

In practice, because of technical and computational 
limitations, we restrict ourselves to axisymmetric po- 
tentials and two-integral stel lar phase-space distributio n 
functions (i.e. f(E,L z ); see iBinnev fc Tremaind [l987l) . 
in order to present a fast and efficiently working algo- 
rithm. We introduce a new Monte Carlo approach to 
Schwarzschild's orbital superposition method, that al- 
lows the f(E, L z ) to be solved in axisymmetric potentials 
in a matter of seconds. All of these restrictions, however, 
should be seen just as one particular implementation of 
the much broader general framework. 

The very core of the methodology lies in the possibil- 
ity of formulating both the lensed image reconstruction 
and the dynamical modeling as formally analogous linear 
problems. The whole framework is coherently integrated 
in the context of Bayesian statistics, which provides an 
objective and unambiguous criterion for model compari- 
son, based on the evidence merit function (MacKav 1992; 
!Mackavll2003l ). The Bayesian evidence penalty function 
allows one to both determine the "best" (i.e. the most 
plausible in an Occam's razor sense) set of non linear pa- 
rameters for an assigned potential and compare and rank 
different potential families, as well as set the optimal reg- 
ularization level in solving the linear equations and find 
the best point-spread function (PSF) model, pixel scales, 
etc. 

The paper is organized as follows: In Sect. [2] we first 
review some relevant aspects of the theory of Bayesian 
inference, and we elucidate how these apply to the frame- 
work of combined lensing and stellar dynamics. In Sect. [3] 
we present a general overview of the framework, with 
particular focus on the case of our implementation for 
axisymmetric potentials. In Sects. [4] and [5l respec- 
tively, we provide a detailed description of the meth- 
ods for the lensed image reconstruction and the dy- 
namical modeling. In Sect. [6] we describe the testing 
of the method, showing that it allows a reliable re- 
covery of the correct model potential parameters. Fi- 
nally, conclusions are drawn in Sect. [3 The application 
of the algorithm to the early-type lens galaxies of the 
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SLACS sample (e g.lBolton et alj|2006t iTreu et all . 
iKoopmans et al.1 120061 : iGavazzi et al will be pre- 

sented in forthcoming papers. 

2. DATA FITTING AND MODEL COMPARISON: 
A BAYESIAN APPROACH 

Before describing the code itself in more detail, in this 
Section we first review the basics of Bayesian statistics. 
This is relevant, because - as will be shown in more detail 
in the coming Sections - both the lensing and stellar 
dynamics parts of the algorithm can be formalized as 
linear sets of equations 

b = Ax + n (1) 
that can be solved within a Bayesian statistical frame- 
work. In the above equation, b represents the data, 
A = A[<S>(ff)] is the model, which will in general depend 
on the physics involved (e.g., in our case, the lens-galaxy 
potential <!>, a function of the non linear parameters if), 
x are the linear parameters that we want to infer, and ft 
is the noise in the data, characterized by the covariance 
matrix C. 

First, we aim to determine the linear parameters x 
given the data and a fixed model and, on a more general 
level, to find the non linear parameters ff corresponding 
to the "best" model. Note that the choices of grid size, 
PSF, etc., are also regarded as being (discrete) param- 
eters of the model family (changing these quantities or 
assumptions formally is equivalent to adopting a different 
family of models). 

Both of these problems can quantitatively be addressed 
within the framework of Bayesian statistics. Following 
the approach of MacKay (lMacKavill992l Il999t iMackavl 
2003; see also lSuvu et alJl2006t l. it is possible to distin- 
guish between three different levels of inference for the 
data modeling. 

1. At the first level of inference, the model A is as- 
sumed to be true, and the most probable imp for 
the linear parameters is obtained by maximizing 
the posterior probability, given by the expression 
(derived from Bayes' rule) 



p(x\b,x,A,n) 



P{b\x, A)P(f|H, A) 



(2) 



P(6|A,A,H) 

where H is a regularization operator, which for- 
malizes a conscious a priori assumption about the 
degree of smoothness that we expect to find in the 
solution (refer to Appendix [X] for the construc- 
tion of the regularization matrix); the level of reg- 
ularization is set by the hyperparameter value A. 
The introduction of some kind of prior (the term 
P(x|H, A) in Eq. [2]) is inescapable, since, due to 
the presence of noise in the data, the problem from 
equation ([T]) is an ill-conditioned linear system and, 
therefore, cannot simply be solved through a direct 

inv ersion along the li ne of x = A _1 6 (see for exam- 
ple iPress et al.lll992l for a clear introductory treat- 
ment on this subject). Note that besides H, any 
a priori assumption (PSF, pixel scales, etc.) can 
be treated and ranked through the evidence. Tra- 
ditional likelihood methods do not allow a proper 
quantitative ranking of model families or other as- 
sumptions. In Eq. ((2|), the probability P(b\x,A) is 



the likelihood term, while the normalization con- 
stant P(b\\, A, H) is called the evidence and plays 
a fundamental role in Bayesian analysis, because it 
represents the likelihood term at the higher levels 
of inference. 

2. At the second level, we infer the most probable hy- 
perparameter Amp for the model A by maximizing 
the posterior probability function 



P(A|6,A,H) 



P(6|A,A,H)P(A) 



(3) 



P(6|A,H) 

which is equivalent to maximizing the likelihood 
P(b\ A, A, H) (i.e. the evidence of the previous level) 
if the prior P(A) is taken to be flat in logarithm, 
as is customarily done, 1 , because its scale is not 
known. 

3. Finally, at the third level of inference the models 
are objectively compared and ranked on the basis 
of the evidence (of the previous level) , 



P(A,H|6) cx P(fe|A,H)P(A,H), 



(4) 



where the prior P(A, H) is ass u med to be flat. 
It has been shown by iMacKavl (|1992f ) that this 
evidence-based Bayesian method for model com- 
parison automatically embodies the principle of 
Occam's razor, i.e. it penalizes those models which 
correctly predict the data, but are unnecessarily 
complex. Hence, it is in some sense analogous to 
the reduced x 2 - 



In Sections 12.11 and 12.21 we illustrate how this general 
framework is implemented under reasonable simplifying 
assumptions and how the maximization of the evidence 
is done in practice. 

2.1. Maximizing the posterior: Linear optimization 

If the noise n in the data can be modeled as Gaussian , 
it is possible to show (jMacKavi fl99l ISuvu et al.ll2006f ) 
that the posterior probability (Eq. [5] ) can be written as 

exp[-E v {x)] 



P(x\b, A, A, H) 



J exp [— E-pix)] dx" 



(5) 



where 



E v =E c {x) + \E n {x). (6) 
In the penalty function from equation ([5]), 

E c (x) = i(A.f - bfc-^Ax - b) (7) 

(i.e. half of the % 2 value) is a term proportional to the 
logarithm of the likelihood, which quantifies the agree- 
ment of the model with the data, and the term 



E n {x)= l -\\Kx\\ 2 



(8) 



is the regularizing function, which is taken to have a 
quadratic form with the minimum in x rcg = (see the 

1 With the statistics terminology this is an "uninformative" 
prior, which tries to repr esent the absen ce of a priori information 
on the scale of A (see e.g. Cousins 1995 and references therein). 
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seminal paper of lTikhonovll963l for the use of the regular- 
ization method in solving ill-posed problems). The term 
En formalizes the only a priori assumption regarding the 
smoothness of the solution, and therefore corresponds to 
the prior term of Eq. |(5J). 

The most probable solution xmp is obtained by maxi- 
mizing the posterior from equation ([5]). The calculation 

of d [Ec(x) + XE-ji(x)} jdx = yields the linear set of 
normal equations 

(A T C" 1 A + AH T H)x = (9) 

which maximizes the posterior (a solution exists and is 
unique because of the quadratic and positive definite na- 
ture of the matrices) . 

If the solution imp is unconstrained, the set of equa- 
tions ^ can be effectively and non-iteratively solved 
using e.g. a Cholesky decomposition technique. How- 
ever, in our case the solutions have a very direct 
physical interpretation, representing the surface bright- 
ness distribution of the source in the case of lens- 
ing (e.g. Section [4|), or the weighted distribution func- 
tion in the case of dynamics (e.g. Section [5]). The 
solutions must therefore be non-negative. Hence, we 
compute the solution of the constrained set of equa- 
tions J9]) using the freely-available L-BFGS-B code, a lim- 
ited memory and bound constrained implementation of 
the Broyden-Fletcher-Goldfarb-Sh anno (BFGS) method 
for solving optimization problems (jBvrd. Lu. fc Nocedall 
[19951 iZhu. Bvrd. fc NoceTIal[l99l . 

2.2. Maximizing the evidence: non linear optimization 

The implementation of the higher levels of inference 
(i.e. the model comparison) requires an iterative opti- 
mization process. At every iteration i, a different set 
fji of the non linear parameters is considered, generating 
a new model A,; = A[$(rfi)] for which the most proba- 
ble solution XMP.i for the linear parameters is recovered 
as described in Sect. 12. li and the associated evidence 
£(rfi,X) = P(6|A, Ai,H) is calculated (the explicit ex- 
pression for the evidence is straightforward to compute 
but rather cumbersome and is, therefore, given in Ap- 
pendix |E|. Then a nested loop, corresponding to the 
second level of inference, is carried out to determine the 
most probable hyperparameter Amp for this model, by 
maximization of £{fji, A). The evidence £i = P{b\ A^, H) 
for the model A^ can now be calculated by marginaliz- 
ing over the hyperparameters. 2 The different models A^ 
can now be ranked according to the respective value of 
the evidence £i (this is the third level of inference), and 
the best model is the one which maximizes the evidence. 
This procedure is in fact very general, and can be applied 
to compare models with different types of potentials, reg- 
ularization matrices, PSFs, grid sizes, etc. 

2.2.1. The Hyperparameters 

In principle, for each ffi one has to determine the cor- 
responding set of most probable hyperparameters Amp,; 
by means of a nested optimization loop. In practice, 

2 However, it turns out that a reasonable approximation is to 
calculate Ei assuming that P(X\b, A^, H) is a S function centered on 
Am p, so that the evidence is obtaine d simply as P(ft|Ajyip, A;, H) 
(sec IMacKavlflgijaiSuvu et~aTll20M) . 
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however, the values of the hyperparameters change only 
slightly when ff is varied in the optimization process, and 
therefore it is not necessary to perform a nested loop for 
A at each ^-iteration. What we do is to start the evidence 
maximization by iteratively changing ff, while the hyper- 
parameters are kept fixed at a quite large initial value Ao 
(so that the solutions are assured to be smooth and the 
preliminary exploration of the ff space is faster). This 
is followed by a second loop in which the best 77-model 
found so far is fixed, and we optimize only for A. The al- 
ternate loop procedure is iterated until the maximum for 
the evidence is reached. Our tests show that the hyperpa- 
rameters generally remain very close to the values found 
at the second loop. Hence, this approximation (which is 
somewhat equivalent to a line-search minimization along 
- alternatively - the ^-parameters and the hyperparame- 
ters) works satisfactorily, significantly reducing the num- 
ber of iterations necessary to reach convergence. 

2.2.2. MCMC and Downhill- Simplex Optimization 

The maximization of the evidence is, in general, not 
an easy task, because obtaining the function value for a 
given model can be very time consuming and the gra- 
dient of the evidence over the non linear parameters ff 
is in general not known or too expensive to calculate. 
We have therefore tailored a hybrid optimization rou- 
tine which combines a (modified) Markov Chain Monte 
Carlo ( MCMC) search wi th a Downhill-Simplex method 
(see e.g. iPress et aLlll992L for a description of both tech- 
niques). The preliminary run of the simplex method (or- 
ganized in the loops described above) is the crucial step, 
because even when launched from a very skewed and un- 
realistic starting point ffo, it usually allows one to reach a 
good local maximum and from there to recover the "best 
set" of parameters, i.e. the 77-set which corresponds to 
the absolute maximum of the evidence function. 

The outcome of this first-order optimization is then 
used as the starting point for a modified MCMC explo- 
ration of the surrounding evidence surface with an ac- 
ceptance ratio based on the evaluation of the normalized 
quantity (£ k - £bmax)/£bmax (where £ k is the evidence 
of the considered point and 4max is the evidence of the 
best maximum found so far; higher maxima are always 
accepted). When a point is "accepted", a fast Simplex 
routine is launched from that point to determine the evi- 
dence value of the local maximum to which it belongs. If 
this turns out to be a better (i.e. higher) maximum than 
the best value found so far, it becomes the new starting 
point for the MCMC. In practice, the whole procedure 
can be sped up by about 1 order of magnitude if some 
phases of the MCMC exploration and the subsequent lo- 
cal peak climbing arc limited to the optimization of the 
evidence of lensing only (whose evaluation is much faster 
than the calculation of the total evidence), while the ev- 
idence of dynamics provides the decisive criterion to dis- 
criminate between the obtained set of local maxima of 
the lensing evidence surface. 

As for the implementation of the Downhill-Simplex 
method, we have made use of both th e freely avail- 
able optimization packages MINUIT (|Jamesl I1994D 
and APPSP ACK, a parallel derivative-free op timization 
package fsee lGrav fe Kolda 2005; Kolda 200j| for an ex- 
haustive description of the algorithm), receiving similar 
results. The effectiveness of this hybrid approach for our 
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Fig. 1. — Scheme of the general framework for joint gravitational lensing and stellar dynamics analysis. See the text for an extended 
description. 



class of problems will be demonstrated in Sect. 16.31 where 
a test case is considered. 

3. THE METHOD 

In this Section we describe the general outline of the 
framework for joint gravitational lensing and stellar dy- 
namics analysis (see Fig. Q] for a schematic flow chart 
of the method). We develop an implementation of this 
methodology (the CAULDRON 3 algorithm) which applies 
specifically to axisymmetric potentials and two-integral 
stellar distribution functions. This restriction is a good 
compromise between extremely fast lensing plus dynam- 
ical modeling and fl exibility. Fully tria xial and three- 
integral methods fe.g. lCretton et alJ T999), although pos- 
sible within our framework, are not yet justified by the 
data quality and would make the algorithms much slower. 
Of course, the true potential might be neither axisym- 
metric nor even triaxial. Hence, if significative depar- 
tures from the assumption of axisymmetry are present 
in the lens galaxy, it will have an effect on the lensing 
and stellar kinematic reconstructions, which cannot be 
accounted for by these simple models. In that case, the 
model needs to be revised correspondingly. However, if 
the model fits the data, in a Bayesian sense, such that no 
signicant residuals are left, then a more complex model is 
not yet warranted by the data. We emphasize that this 
does not prove that the galaxy is not triaxial, has three 
integrals of motion, or is even more complex, but only 
that the data cann ot make a statement about it (see e.g. 
iLiddle et all 120071 for an illustrative discussion on this 
topic). 

The technical details and a more exhaustive explana- 
tion of the terminology are given in the following Sections 
and in the Appendices. 

First, consider a set of observational data for an ellip- 
tical lens galaxy consisting of (1) the surface brightness 
distribution of the lensed images (which we will refer to 
as the lensing data d) and (2) the surface brightness dis- 

3 Combined Algorithm for Unified Lensing and Dynamics 
RcconstructiON . 



tribution and the first and second velocity moments map 
of the lens galaxy itself (hereafter the kinematic data p) . 
It is also assumed that the redshift of the source (z s ) and 
of the lens (za) are known. 

Second, we choose a family of gravitational potentials 
3>(x, ff) which we expect to provide a reasonably good 
description of the true potential of elliptical galaxies for 
some set of model parameters ff, such as the normaliza- 
tion constant, core radius, oblateness, slope, etc. 4 The 
vector ff can also include other "non intrinsic" quantities, 
such as the inclination angle, the position angle and the 
coordinates of the center of the elliptical galaxy. 

In order to understand the physical characteristics of 
the lens galaxy, the first problem consists of finding 
the specific values of parameters ff that optimize some 
penalty function based on the mismatch between the 
data and the model, within the solution space allowed by 
the chosen potential model. A more general and difficult, 
but also much more interesting problem, is an objective 
and quantitative ranking of a set of different families of 
potentials {<t(x, ff)}. Our method is designed to address 
both of these questions; given the data and a choice for 
the potential, the algorithm yields the most likely solu- 
tion of linear parameters and a value for Bayesian evi- 
dence £ (ff). The maximum evidence solution £ ma x(?7max) 
allows direct and objective model family comparison. 
The model family can also include different choices of 
pixel scales, regularization, PSF models, etc. 

A comparison based on the evidence, as was described 
in Section[2j overcomes many of the traditional hurdles in 
reg ularization and objective model c omparison (see also 
e.g. ISuvu etaHl2006t lMarshal]||2006l for a thorough dis- 
cussion of the Bayesian framework in the context of non 
parametric lensing) and is, therefore, extremely power- 
ful. 

3.1. The cauldron algorithm 

4 We note that in principle if could even be the potential values 
on a grid of (R, z) and, hence, be grid-based itself. 
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Whereas the lensed-image reconstruction is general 
and can straightforwardly be applied to an arbitrary po- 
tential, in the case of the dynamical modeling we de- 
scribe how it can be coherently integrated into the gen- 
eral framework, but then focus on the implementation 
for axisymmetric potentials and two-integral distribution 
functions. The algorithm requires a set of observables d 
and p, and the choice of a parametric model &(x, ff). An 
initial set ffi (with i = 0) for the non linear parame- 
ters is selected. We indicate the corresponding potential 
as $i = $>(x,7fi). This potential is used for both the 
lensed-image reconstruction and the dynamical model- 
ing, so that the method is completely self-consistent and 
makes full use of all the data constraints. 

3.1.1. Gravitational Lens Modeling 

The basic idea for the lensed-image reconstruction is 
the following. (1) One describes the source brightness 
distribution by a set of (possibly irregularly spaced) pix- 
els in the source plane s with each pixel value repre- 
senting the source surface brightness at the location of 
the pixel. (2) A matrix operator L is then constructed 
for a given lens potential, which multiplied by a given s 
(and blurred b y the PSF) represents the observed lensed 
image (see e.g. IWarren fe Dvell2003t pTreu fe Koopmans) 
2001 lKoopmansll2005t iDve fc W arren 2005; Suvu et all 
20061 : IWavth fc Webster 2006; B rewer fc L ewis 2006ah. 

In practice, the three-dimensional potential $i is inte- 
grated along the line of sight z' to obtain the projected 
potential tpi. The deflection angle is then determined 
from ipi . The particular grid-based source recons truc- 
tion metho d intro duced in lTreu fc Koopmansl (|2004r ) and 
iKoopmansI (|2005l ) then allows one to quickly construct 
the lensing operator = L(r/i) and to determine the 
most probable pixelized source s*MP,i by maximizing the 
posterior probability (Section 12. lj) . More discussion of 
the lensing routine is presented in Section 21 

3.1.2. Stellar Dynamical Modeling 

To construct the axisymmetric dynamical model (and 
the corresponding operator Q, see Sect. 13.1^ that repro- 
duces the kinematic data set, a Schwarzschild method 
(Schwarzschild 1979) is used. Within this flexible and 
powerful framework, a library of stellar orbits is inte- 
grated in an arbitrary potential $ {x,ff) and the specific 
weighted superposition of orbits is determined which best 
reproduces the observed surface brightness distribution 
and line-of-sight velo city moments of the galaxy (e.g. 
lRichstondfl980lH98l . 

In the case of the CAULDRON algorithm for axisym- 
metric potentials &(R,z,ff), we use the two-integra l 
Sch warzschild method deve l oped by lCretton et al.l (|1999h 
and IVerolme fc de Zeeuwl (|2002f ). In contrast to the 
"classical" Schwarzschild method, here the building 
blocks for the galaxy are constituted not by orbits, but by 
"two-integral components" (TICs) , derived from Dirac 5 
two-integral distribution functions (i.e. as functions of 
the orbital energy E and the axial component of the 
angular momentum L z ). A TIC can be thought of as 
an elementary building block of toroidal shape, charac- 
terized by a combination of orbits that produces a 1 / R 
radial density distribution and very simple (analytically 
calculable) velocity moments. For any TIC the projected 
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observables (e.g. in our case surface brightness and line- 
of-sight first and second velocity moments) can then be 
straightforwardly calculated given <I>i. The projected ax- 
isymmetric density distribution and velocity moments 
can be obtained as a weig hted superposition of TICs 
(jVerolme fc de Zeeuwl[2002l ). 

The aim of the dynamical modeling is therefore to re- 
cover a set of weights which describe how the stellar or- 
bits (or, in the case of the axisymmetric implementation, 
the TIC observables) are linearly superposed to match 
the data. In analogy to the lensing case, this is done by 
maximizing the posterior probability (Sect. 12. lj) . For a 
more extended description of the dynamics routine and 
the generation of the TICs, we refer to Section [5] and 
Appendix [C] 

3.1.3. Linear optimization of the posterior probability 

A consequence of the previous considerations and an 
important feature of the algorithm is that both the gravi- 
tational lensing source reconstruction and the dynamical 
modeling can be expressed in a formally analogous way 
as sets of coupled linear equations of the form 

Ls = d (lensing) 

(10) 

Q7 = p (dynamics). 

The "lensing operator" L encodes how the source sur- 
face brightness distribution s is mapped on to the ob- 
served image d. Each of the N s columns of L describe 
how a point source localized on the corresponding pixel is 
lensed and possibly multiple imaged on the image plane 
grid. Similarly, the "dynamical operator" Q contains 
along its N 7 columns all the information about the ob- 
servables generated by each individual orbit or TIC (i.e. 
the surface brightness and the weighted line-of-sight ve- 
locity moments, written in pixelized form on the same 
grid as the data), which are superposed with weights 7 
to generate the data set p. 

The crucial advantage of this formulation lies in the 
fact that each element of Eq. (fT0|) is a linear system of 
equations which can be solved in a fast and non iterative 
way, using the standard linear optimization techniques. 
Because both L and Q arc built from the same lens po- 
tential $(77), both sets of equations are coupled through 
the non linear parameters ff. 

Because of the ill-posed nature of the set of equa- 
tions (fT0|) (i.e. the data are noisy), as was discussed in 
Section [2. 1[ finding the solution for s and p that maxim- 
imize the posterior probability translates into solving a 
set of (regularized) linear equations 

{ (11) 
[ (Q T C D 1 Q + A D HTH D )7 = Q T C D V 

where C L and C D are the covariance matrices for lensing 
and dynamics, respectively, H L and H D are a choice for 
the regularization matrix, and Al and Ad are the corre- 
sponding regularization hyperparameters. Note that, for 
A = (i.e. in absence of regularization), the solution of 
Eqs. (jllj) is equivalent to the maximum likelihood solu- 
tion in the case of Gaussian errors. 

Once L and Q have been constructed, from Eqs. pip 
we can derive the solutions 4ip,i and 7MP.i, relative to 
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the choice of the non linear parameters rfi and of the hy- 
perparameters = (Al,Ad)i- This, however, represents 
just a single possible model, namely, <&i = &(rji), belong- 
ing to one family of potentials $(ff) and, in general, will 

not be the "best" model given the data d and p. 

3.1.4. Non linear optimization of the Bayesian evidence 

In the framework of the Bayesian statistics, the "plau- 
sibility" for each considered model can be objectively 
quantified through the evidence (see Section [2]), a merit 
function which includes (but is not limited to) a likeli- 
hood penalty function, but can also take into account the 
effects of the choice of regularization, grid, PSF, etc. The 
set of non linear parameters ?7bcst, and the corresponding 
best model $bcst = "^(^/bcst), is obtained by maximizing 
the evidence through an iterative optimization loop. For 
each cycle i, a different set of non linear parameters ffi is 
chosen, and the most probable solutions %p,i and 7MP,i 
are found as described before. The evidence £i which 
characterizes this model is then calculated to allow an 
objective comparison between different models. 

Once the evidence is maximized, we are left with the 
best model $bcst (characterized by the value £ max for 
the evidence) and the best reconstruction for the source 
brightness distribution s*MP,bcst and the TIC weights 
7MP,best (which is one-to-one related to the distribution 
function). At this point, the algorithm has provided us 
with the best set of non linear parameters ?7bcst, he. the 
unknowns that we are most interested in. Nevertheless, 
we might wish to tackle the more general problem of 
model family comparison, by considering a different fam- 
ily of potentials $ and applying again the full algorithm 
to it. The result will be a vector ff hcst and a value for 
the evidence £ m ax that we can directly compare to the 
value £ max that we had previously found, determining in 
this way whether the potential <& or <t represents a better 
model given the constraints. 

We now proceed to describe the lensed-image recon- 
struction and the dynamical modeling routines in much 
greater detail. The reader less interested in the technical 
details and more in the application of the algorithm, can 
continue with Sect. [6] without loss of continuity. 

4. GRAVITATIONAL LENSING 

Gravitational lensing can be formulated as the recon- 
struction of an unknown source brightness distribution s 
(pixelized on a grid composed of N s pixels) given the ob- 
served and PSF- convoluted image brightness distribution 
d (sampled on a grid of dimension N&). To tackle this 
problem we have made use of the implementation of the 
method of no n-parametric source re construction initially 
developed by[\Varrcn & Dye (2003) a nd fur t her refined 
and /o r a dapted bvlTreu fc Koopmansl (120041) ;lKoopmansl 
(l2005lh iDve fc Warrenl (120051): ISuvu et al.l (1 2006) ; 
IWavth fc Websterl (|2006l ); iBrewer fc Lewis! (j2006aF ~ 

Each pixel i (with 1 < i < N^) on the image grid, 
located at position x, is cast back to the source plane (at 
position y) through the lensing equation 



Since gravitational lensing conserves the surface bright- 
ness E, the equivalence E(j/) = £(;?) holds (if the ef- 
fect of the PSF is neglected). In general, however, y 
will not exactly correspond to the position of a pixel of 
the fixed source grid. Therefore X(y) is expressed as a 
weighted linear superposition of the surface brightness 
values at the four pixels j% . . . (where the index j runs 
in the interval 1 . . . N s ) w hich delimit the position y (see 
iTreu fc Koopmandl2004j ). The weights Wi . . . Wi for each 
of the source pixels are the bilinear interpolation weights 
(whose sum adds to unity to conserve flux) , and they are 
stored as the elements L i j 1 . . . Lij 4 of a (very) sparse ma- 
trix L of dimension N<i x N B , which represents the lensing 
operator. If the image pixel i is cast outside the borders 
of the source grid, all the elements of the i-th row of L are 
put to zero. In case we need to be more accurate, we can 
split each image grid pixel into n = n\ x n-i subpixels and 
apply the same procedure as before to construct an n- 
f actor ovcrsamplcd lensing matrix of dimension nN^ x N s . 

The lensing operator is therefore a non linear function 
of the parameter vector fj through the potential, i.e. L = 
L($(t/)), and must be constructed at each iteration of the 
CAULDRON algorithm. From L we then construct the 
blurred lensing matrix M = BL, where B is a blurring 
operator (this is a square matrix of order equal to the 
number of rows of L) which accounts for the effects of 
the PSF 5 . If we are dealing with an oversampled lensing 
matrix, we need to include also a resampling operator 
R (of dimension x nN^) that sums the oversampled 
pixels together so that in the end the blurred lensing 
matrix is defined as M = RBL. 

Within this framework, the mapping of the source into 
the lensed image can be expressed as the set of linear 
equations (cfr. Eq. [To] ) 



Ms 



y 



d{x) 



(12) 



where a is the deflection angle, calculated from the 
gravitational potential $ as described in Appendix [Bj 



(13) 

As discussed in Section |2~T1 for Gaussian errors the solu- 
tion s*mp of the ill-conditioned linear system from equa- 
tion (|13|) is found by minimizing the quadratic penalty 
function 

V len [s, Hv)] - ^ (Ms - d) T CZ X (Ms- d) + ^\\Us\\ 2 

(cf. Eq. [6]) by varying s and finding dVi C n/ds = 0. Here 
Cl is the (diagonal) lensing covariance matrix, H is the 
lensing regularization matrix, and Ai en is the correspond- 
ing regularization hyperparameter. This problem trans- 
lates (see again Sect. 12. ip into solving the set of linear 
equations 

(M t Cl 1 M + Ai cn H T H)s = M T C^d (15) 
(cf. Eq. 0). 

Although in Eqs. (11411 and (fT5|) we have indicated the 
regularization matrix simply as H for the sake of clar- 
ity, it should be noted that, since s represents a two- 
dimensional grid, it is necessary in practice to consider 
regularization both in the x- and y-directions, as de- 
scribed, respectively, by matrices and H. y . Therefore, 
the regularization term in Eq. (fT5|) becomes Ai cn (HjH. E + 

5 In more detail, the i-th row of the matrix B contains a dis- 
cretized description of how the PSF is going to spread the surface 
brightness at the i-th pixel of the image d over the surrounding 
grid pixels (see discussion in Trcu & Koopmans 2004). 
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Fig. 2. — The left columns presents the simulated source s s i m and the corresponding lensed image ci s i m with Gaussian noise added. The 
middle column shows the reconstructed quantities obtained with the lensing operator L and a particular choice for the regularization. The 
residuals are shown in the right column. 



h 2 H.yH y )s, where h = Ax/ Ay is the ratio between the 
horizontal and vertical pi xel scales in the c ase of "curva- 
ture regularization" (see ISuvu et all I2006L for a discus- 
sion of different forms of regularization). For a specific 
description of how the actual regularization matrices are 
constructed, we refer to Appendix lAl 

5. DYNAMICS 

In this Section we describe the details of the fast two- 
integral Schwarzschild method for the dynamical mod- 
eling of axisymmetric systems, which is implemented in 
the cauldron algorithm. It should be noted, however, 
that Eqs. (fT6| - (|18p are also valid in the general case of 
arbitrary potentials, provided that 7 is interpreted as the 
vector of the weights of the different (numerically inte- 
grated) stellar orbits of some orbit library. However, the 
actual construction of the "dynamical operator" Q, as 
described in Sect. 15. i] is specific to this particular imple- 
mentation. 

As already shown in Sect. 13.11 from a formal point of 
view the problem of dynamical modeling is identical to 
that of lensing (Sect. HJ, essentially consisting of finding 
the most probable solution tmp for the ill-constrained 
linear system 

Ql = P (16) 

(cf. Eqs. [13] and [TO]). Here Q is the dynamical oper- 
ator which is applied to the vector 7 of the weights of 



the building- block S distribution functions f(E, L z ) (the 
TICs) to generate the set of observables p. See Sect. 15.11 
for a more in-depth description of the meaning of the 
mentioned quantities and the construction of the matrix 
Q. As described in Section l!Q] one derives the solution 
7mp by minimizing the quadratic penalty function 

7W7, *{f?)] =\(Q1- P) T C D 1 (Q7 -P) + 

i(A d / n ||K L 7|| 2 + A d / n ||K B 7l| 2 )(17) 
which corresponds to solving the set of linear equations 

(qTc^q + a^kTk, J- ^ d ^Ti 



Q T C^V 

(18) 



(note the equivalence with Eqs. p^|) and (fT5|) for lens- 
ing). Here we have indicated the (diagonal) dynamics 
covariance matrix as Cq and the regularization matri- 
ces along the E- and L z -axes of the 7-grid as K £ and 
K e , respectively. The regularization along these two di- 
rections are (assumed to be) uncorrelated, and the corre- 



sponding hyperparameters A £ yn and A^ yn must therefore 
be considered independently. 

With respect to their physical meaning, however, 
the construction of the linear operators L and Q is 
a markedly distinct problem: the lensing operator de- 
scribes the distortion to which the source surface bright- 
ness distribution is subjected, while the dynamics opera- 
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Fig. 3. — In the first row we display the weighted distribution function 7 (see text) sampled in the integral space (E,L Z ), together 
with the set of mock observables generated by this choice (with non-uniform Gaussian noise added): the surface brightness distribution 
S, the line-of-sight stellar streaming velocity {v z /) and the line-of-sight velocity dispersion a 2 . The second row shows the corresponding 
reconstructed quantities. The residuals are given in the last row. 



tor is effectively a library which stores the projected ob- 
servables (surface brightness distribution and unweighted 
velocity and velocity dispersion maps) associated with 
the set of elementary (i.e. Dirac 8) two-integral distribu- 
tion functions. 

5.1. The "dynamics operator" by means of a 
two-integral axisymmetric Schwarzs child method 

In this Section we describe how to construct the dy- 
namics operator Q (which is the most complex part 
of implementing Eq. |18| explicitly), introducing a new 
and extremely fast Monte Carlo implementation of the 
two-inte gral components Schw arzs child method, as pro- 
posed bv lCretton et al.l (|1999f ) and lVerolme fc de Zeeuwi 
(|2002ft . The Schwarzschild method is a powerful and 
flexible numerical method to construct numerical galaxy 
models without having to make any prior assumptions re- 
garding the shape, the symmetry, the anisotropy of the 
system, etc. 

In its original implementation ('Schwarzschildl |1979[ ) , 
the procedure works as follows. An arbitrary density dis- 
tribution (possibly motivated by observations) is chosen 



for the galaxy and the corresponding potential is com- 
puted by means of the Poisson equation. One then cal- 
culates a library of stellar orbits within this potential 
and finds the specific superposition of orbits which re- 
produces the initial density distribution. The method 
can be generalized to also treat cases in which den- 
sity and potential are not a self-consis tent pair and to 
include kinematic constraints (see e. g. iRichstonel Il980. 
[T^IPfenni^[l9M[Rrx et al.lll997fl . 

Orbits, however, are not the only possibility for the 
building blocks of a Schwarzschild method if one only 
aims to construct two-integral axisymmetric models for 
galaxies. Given an axisymmetric potential <&(-/?, z), one 
can also consider more abstract co nstituents called two- 
integral components or T ICs (see iCretton et al.l 119991 : 
IVerolme fc de Zeeuwll2002t) which correspond to elemen- 
tary Dirac 5 distribution functions completely specified 
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by a choice of energy Ej and angular momentum L 



fj(Ej,L Zj j) 



where Cs = Ci 



-<$(£ - £ 3 )<5(L 2 - inside ZVC 

elsewhere 

(19) 

] is a normalization coefficient chosen 
such that all the TICs have equal mass (see Appendix |D] 
for an explicit expression for Cj). The zero- velocity curve 
(ZVC) is the curve in the meridional plane (R, z) for 
which 

E kin = V(R, z)-^2- E = Q, (20) 

where E is the relative energy and V(R, z) = — z) is 
the relative potential; another frequently useful quantity 
is the effective potential V e s (R, z) = V(R, z) - L 2 /2R 2 . 

A TIC-based Schwarzschild method has two main ad- 
vantages. First, the j-th TIC can be interpreted as a 
particular co mbination of all orb its (both regular and 
irregular, see ICretton et al.1 11999( ) with energy Ej and 
angular momentum L ZJ which can be integrated in the 
potential $(i?, z) and completely fill the ZVC. Therefore, 
the TICs constitute a family of building blocks smoother 
than the regular orbits (which may have sharp edges) and 
automatically take into account the irregular orbits. Sec- 
ond, the unprojected density and velocity moments for 
the TICs have simple analytic expressions, which makes 
this method much faster than the ordinary orbit integra- 
tion. 

From the definition from equation p9[) . the density 
distribution in the mer idional plane generated by the j- 
th TIC is given by fsee lBinnev & Tremainelll987h 



pj(R,z) 



R 



inside ZVC 



(21) 



elsewhere, 



while the only non-zero velocity moments have the fol- 
lowing expressions inside the ZVC: 



Pj( v R)j = Pj( v z)j 



I \ - nC i t 



R 3 *,V 

^[V eS {R,z) 



E< 



(22) 
(23) 



(24) 



and they vanish elsewhere. 

Note that we cannot directly compare the quantities 
described by Eqs. (|2l"j) - (f24"]) with the observations. Be- 
fore this can be done, we need to calculate the projected 
quantities, grid them, and convolve them with the appro- 
priate PSF (and possibly regrid them again in the case 
of subgridding; see text) . 

The surface brightness Ej (sampled on a grid of 
Nx elements) and the weighted line-of-sight veloc- 
ity moments Tij(v z >)j and ^j(v 2 ,)j (both sampled on 
a grid of N/ v ,\ elements) can be obtained semi- 
an alytically through multi-d imensional integrals (refer 
to lVerolme fc de Zeeuwil2002t ). Because the same semi- 
analytic approach to calculate Q is rather time consum- 
ing, we have developed a very efficient Monte Carlo im- 
plementation of the two-integral Schwarzschild method, 



detailed in Appendix [C] which is several orders of mag- 
nitude faster. 

Whatever technique is adopted, the projected and 
PSF-convoluted quantities for the j-th TIC, sampled on 
the appropriate grids, constitute the j-th column of the 
Q matrix. Therefore, if the galaxy model is built using 
a library of N 7 = Ne x Nl, TIC building blocks (where 
Ne and Nl z are the number of samplings, respectively, 
in energy and angular momentum), the dynamics oper- 
ator Q turns out to be a dense matrix of dimensions 
A 7 x (Nv+2N M ). 

The meaning of the vector 7 in Eq. (fT6|) is now clear: 
its iV 7 dimensionless non negative elements jj describe 
the weights of the linear superposition of the model ob- 
servables generated by the library of TICs, and we look 
for the solution 7mp (given by Eq. [T5]) which best re- 
produces the data set p. Moreover, as explained in 
Appendix [Dl the weights jj are proportional to the 
light contributed by the TIC to the galaxy and related 
to the reconstructed dimensional distribution function 
DF(Ej , L z ,j) when they are normalized with the TIC 
area Azvc,j in the meridional plane and the surface of 
the cell A\e-,l x ■] m integral space. 

6. DEMONSTRATION OF THE METHOD 

In this Section we describe several of the tests that 
we have performed to show the proper functioning of the 
method. We construct a simulated data set (including re- 
alistic noise) for both lensing and dynamic s by adopting 
the potential given in Eq. ([!§ (lEvanslll994h and a partic- 
ular choice for the non linear parameters if (see Sect. 16.11 
for a description of the setup). Then in Sect. 16.21 keeping 
the non linear parameters fixed, we test the linear opti- 
mization part of the method by showing that the code is 
able to faithfully reconstruct the source surface bright- 
ness distribution and the distribution function used to 
generate the mock data. Finally, in Sect. 16.31 we execute 
a full run of the code. We use the mock set of simulated 
data as input and adopt a starting set of non linear pa- 
rameters considerably skewed from the "true" values, in 
order to verify how reliably the linear and non linear pa- 
rameters are recovered. Note that, for conciseness, what 
we refer to as the "evidence" (£) is, rigorously speaking, 
the logarithm of the evidence as presented in Sect. [2] See 
Appendix [El for its precise definition. 

6.1. The test setup 

As mentioned, for testing purposes it is c onvenient to 
make use of the Evans' po wer-law potentials (|EvansHl9~94l : 
lEvans fc de Zeeuw|[l99l 



$(R,z) = — 



2\ 0/2 



R 2 + R 2 



p ? 0, (25) 



where $0 is the central potential, R s is a core radius and 
q is the axis ratio of the e quipotentials. For j3 = this 
becomes the iBinnevi ()1981l ) logarithmic potential. 

What makes this class of axisymmetric galaxy models 
suitable for the setup of a test case, is the overall sim- 
plicity of their properties. The power-law galaxy models 
are characterized by elementary two-integral distribution 
functions and provide fully analytical expressions for the 
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density distribution, associated with the potential via the 
Poisson equation, the second velocity moments and the 
deflection angle (see Appendix IF]) . Moreover, even the 
projected quantities (i.e. the projected surface density 
and the line-of-sight second velocity moment) are analyt- 
ical. The mean stellar streaming velocity (v v ) is assigned 
th rough a physically motiva ted prescription (see Sect. 2.3 
of lEvans fc de Zeeuwlll994l ) which leads to a simple odd 
part for the distribution function, which does not con- 
tribute to the density and the second velocity moment. 

With the choice of the potential from equation (|25|) for 
the lens galaxy, the elements of the ff vector of non linear 
para met ers are f3, q, R s and <E>o (or equivalently, through 
Eq. |B4j . the lens strength an). In addition, ff includes 
the parameters that determine the observed geometry of 
the system: the inclination i, the position angle "dp a, and 
the coordinates £n of the center of lens galaxy on the sky 
grid. We note that ff can also include the external shear 
strength and position angle, although in our current tests 
we assume negligible shear for the sake of simplicity and 
to fully concentrate on the parameters of the lens galaxy 
only, in order to test whether their degeneracies can be 
broken by the combination of lensing and stellar kine- 
matic data. We also need to make a choice for the size 
of the grids in pixels. This actually constitutes an ex- 
plicit prior choice (just like the type of regularization) . 
The evidence however can be used to test exactly these 
types of assumptions. Explicitly, our test setup is the 
following. 

• Lensing: For the test setup, we adopt a 40 x 40 
pixel grid (N s — 1600) in the source plane and a 
100 x 100 pixel grid (N d = 10000) in the image 
plane. The lensing operator is built using an over- 
sampling factor of 3 to improve the quality of the 
reconstruction. 

• Dynamics: In the two-integral phase space we con- 
sider a grid of Ne = 10 elements logarithmically 
sampled in R C {E) and Nl z = 5 elements lin- 
early sampled in angular momentum, i.e. a total 
of iV 7 = 100 TICs (note that the grid must be mir- 
rored for negative L z ). Each TIC is populated with 
-Wtic = 10 5 particles. This number of TICs (when 
compared to the grid size for the lensing source for 
example) has been verified to be a good compro- 
mise between the quality of the distribution func- 
tion reconstruction and the heavy computational 
power needed in the iterative scheme for the con- 
struction of many TICs. The surface brightness 
and the line-of-sight velocity moments are sam- 
pled on different grids 6 of, respectively, 50 x 50 and 
21 x 21 elements (Ns = 2500 and AT (tv) = 441). 
Analogous to the case of lensing, an oversampling 
factor of 3 is adopted in the construction of the 
operator Q. 

6.2. Demonstration of the linear reconstruction 

We select and fix the following arbitrary (albeit phys- 
ically plausible) set of values for the ff vector: /3 = 0.28, 

6 The observed surface brightness and line-of-sight velocity mo- 
ments often are obtained with different instruments, hence the need 
for different grids. 



q = 0.85, R c = 0.3", a = 4.05" (and the ratio 7 
D ds /D s is taken to be 0.75), i = 60°, ??pa = 0° and 

£o = (0.25", —0.25"). This makes it possible to construct 
the lensing operator 8 M and the dynamics operator Q 
(Sects. 0] and [5]), which are then used to generate a simu- 
lated data set. With a 3 Ghz machine, the construction 
of the sparse lensing matrix is a very fast process (less 
than 1 s). Constructing the dynamics operator is more 
time consuming, although requiring in the above case still 
only about 7 s, i.e. of the order of 100 ms per TIC (it 
should be noted that this is a very short time for build- 
ing a dynamical library and is a direct consequence of the 
Monte Carlo implementation described in Appendix [C]) 
In addition: 

• Lensing: We adopt an elliptical Gaussian source 
surface brightness distribution s S i m and using the 
Evans' potential from equation (|25|) (see Eq. |F1| 
for the analytic expression of the corresponding de- 
flection angle) we generate the blurred lensed im- 
age. From this we obtain the final mock image 
rfsim by adding a Gaussian noise distribution with 
a = 0.03 times the image peak value. This is illus- 
trated in the first column of Fig. [2] 

The reconstruction of the non- negative source s rec , 
from the simulated data c4im, is obtained by solv- 
ing the linear system of equations (|15p . with the 
adoption of a fiducial value for the regularization 
hyperparameter (logAi on = —1.0). Although the 
matrix M t Cl 1 M + Ai en H T H in Eq. (15]) is large 
(10000 x 1600), it is very sparse and therefore, using 
L-BFGS-B, it only takes < 1 s to find the solution. 
The result is shown in Fig. [2] (middle column) to- 
gether with the residuals (right column). 

• Dynamics: We generate the simulated data set p sun 
for a self-consistent dynamical system described by 
the distribution function of the Evans' power-law 
potential. We adopt the same parameters used 
for the above potential. This kind of assumption, 
in general, is not required by the method. How- 
ever, we adopt it here because it immediately and 
unambiguously allows us to check the correctness 
of the simulated data in comparison to the ana- 
lytic expressions. In this way, it is possible to ver- 
ify that the considered TIC library 7* s i m , although 
composed of only 100 elements, indeed represents 
a fair approximation of the true distribution func- 
tion. 

The first two panels in the first row of Fig. [3] show 
the TIC weights 7* s i m (i.e. the weighted distribution 
function) sampled over the two-integral space grid. 
The remaining panels display the simulated data: 
the surface brightness distribution, the line-of-sight 
streaming velocity and the line-of-sight velocity 
dispersion 9 As can be seen, non-uniform Gaussian 
noise has been added to the simulated data (its 

7 We indicate as D s , and D^ s the angular diameter distance 
of, respectively, observer-source, observer-lens and lens-source. 

8 To obtain M, which is the blurred lensing operator, it is nec- 
essary to have the blurring operator B, which we construct from 
the 7 x 7 pixel PSF model. 

9 As explained in Appendix [C] the calculated quantity is not 
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full characterization is taken into account in the 
covariance matrix C^ 1 ). The noise on the surface 
brightness is a few percent of the value of this 
quantity in the inner region, while, in order to 
make the test case more realistic, the noise on 
the velocity moments is increasing outwards and 
becomes quite severe in the external regions. 10 

The reconstruction of the non negative TIC weights 
7roc is given by the solution of Eq. (jT5J) (the chosen 
values for the hyperparameters are logA^ yn = 9.2 

and logA^ yn = 9.4). The reconstruction of the dy- 
namical model constitutes, together with the gener- 
ation of the TIC library, the most time consuming 
part of the algorithm, requiring almost 10 s. This 
is a consequence of the fact that the 3382 x 100 dy- 
namics operator Q, although much smaller than L, 
is a fully dense matrix. If the number of TICs is sig- 
nificantly increased, the time required for calculat- 
ing the term Q T C r) 1 Q in Eq. (fig)) and to solve that 
set of linear equations with the L-BFGS-B method 
increases very rapidly (as does the time needed 
to generate the TICs, although less steeply), and 
therefore the dynamical modeling is typically the 
bottleneck for the efficiency of the method. 

The results of the reconstruction are shown in the 
second row of Fig. [3] (whereas the third row shows 
the residuals). The reconstruction is generally very 
accurate. 

Having verified the soundness of the linear reconstruc- 
tion algorithms, the next step is to test how reliably the 
method is able to recover the "true" values of the non lin- 
ear parameters ff from the simulated data, through the 
maximization of the evidence penalty function £{ff). 

6.3. Non linear optimization 

We first run the linear reconstruction for the refer- 
ence model M.ief described in Sect. 16.21 optimized for 
the hyper-parameters, to determine the value of the to- 
tal evidence £tot,rcf = £icn,ref + £dyn,rcf (reported in the 
first column of Table [T]). Since this is by definition the 
"true" model, it is expected (provided that the priors, 
i.e. grids and form of regularization, are not changed) 
that every other model will have a smaller value for the 
evidence. 

Second, we construct a "wrong" starting model Mq = 
M(ffo) by significantly skewing the values, indicated in 
the second column of Table [TJ of the four non linear 

the line-of-sight velocity dispersion <rf os , but instead (i>^,), which 
is additive and can therefore be directly summed up when the TICs 
are generated and superposed. The line-of-sight velocity dispersion 
to compare with the observation, however, can be simply obtained 
in the end as: 

°to. = <tft>-<««'> a - ( 26 ) 

10 The noise on the velocity moments (as in the case of real 
data) depends on the signal-to-noise ratio of the surface brightness 
at a given point. Hence the signal-to-noise ratio on the velocity 
moments decreases outward, since the noise is constant but the 
signal is not. The noise level was set at a level generally consistent 
with both the HST imaging, Keck spectroscopy and VLT VIMOS- 
IFU data that is currently being obtained as part of the SLACS 
survey. 



TABLE 1 

Results for iterative search of the best model parameters 
via evidence maximization (see text). 







Af r cf 


Mo 


Alstop 


A1 flna i 




i [deg] 


60.0 


25.0 


75.6 


62.6 


non linear 




4.05 


5.60 


3.86 


3.99 


parameters 


P 


0.280 


0.390 


0.288 


0.285 




q 


0.850 


0.660 


0.880 


0.857 




log A len 


-1.04 


0.00 


-1.04 


-1.04 


hyper- 




8.00 


12.0 


7.99 


8.00 


parameters 


log xT 


9.39 


12.0 


9.55 


9.39 






-22663 


-58151 


-22668 


-22661 


evidence 


^■dyn 


12738 


-156948 


12672 


12856 




£tot 


-9925 


-215099 


-9996 


-9805 



The reference model in the second column is the "true" model 
which generated the simulated data sets, and is shown for compar- 
ison. The first group of rows list the non linear parameters ff which 
are varied in the loop. The second group shows the hyperparame- 
ters. The last group shows the evidence relative to the considered 
model; the contributions of the lensing and dynamics part to the 
total evidence are also indicated. 

parameters: the inclination i, the lens strength ao, the 
slope (3 and the axis ratio q. Figures H] and [5] show that 
Mo is clearly not a good model for the data in hand. We 
do not set boundaries on the values that the non linear 
parameters can assume during the iterative search, ex- 
cept for those which come from physical or geometrical 
considerations (i.e. inclination comprised between edge- 
on and face-on, ao > 0, < q < 1, Q < < 1). The 
position angle, in general, is reasonably well-constrained 
observationally 11 , and therefore including it with tight 
constraints on the interval of admitted values would only 
slow down the non linear optimization process without 
having a relevant impact on the overall analysis. It is 
therefore kept fixed during the optimization. We also 
keep the core radius of the mass distribution fixed even 
though in general it is not well constrained by the sur- 
face brightness distribution and in real applications it 
should be varied. It always remains possible, once the 
best model has been found, to optimize for the remain- 
ing parameters in case a second-order tuning is required. 

Adopting ffo as the starting point for the exploration, 
the non linear optimization routine for the evidence max- 
imization is run as described in Sect. 12.21 The model 
A^ s tep (third column of Table[T]) is what is recovered after 
three major loops (the first and the last one for the opti- 
mization of the varying non linear parameters, the inter- 
mediate one for the hyperparameters) of the preliminary 
Downhill-Simplex optimization, for a total of ~ 1000 it- 
erations, requiring about 1.2 x 10 4 s on a 3 Ghz machine. 
From this intermediate stage, the final model Alfmai 
is obtained through the combined MCMC+Downhill- 
Simplex optimization routine, in a time of the order of 
12 — 15 hours. 

Further testing has shown that increasing the number 
of loops does not produce relevant changes in the deter- 
mination of the non linear parameters nor of the hyper- 
parameters, and therefore extra loops are in general not 

11 As a consequence of the assumption of axisymmetry, the posi- 
tion angle of the total potential should match by construction the 
position angle of the surface brightness distribution. 
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Fig. 4. — Outcome of the non linear optimization routine for the 
lensing block. The first row shows the simulated source and lensed 
image. In the second row is the starting model Mo obtained with 
the skewed parameters, which clearly produces a poor reconstruc- 
tion of the image (hence the very low value of the evidence) . The 
third row displays the final model -Mfl na i, that is the "best model" 
recovered through the non linear optimization for evidence maxi- 
mization. This should be compared to the reference model M Te { 
(i.e. generated with same if set of parameters which were used to 
generate the mock data), presented in the second column of Fig. [2] 
The last row shows the residuals of the final model when compared 
to the simulated data. Refer to Table [T] for the non linear param- 
eters and hyperparameters of the models and the corresponding 
value of the evidence. 

necessary. We also note that, in all the considered cases, 
the first loop is the crucial one for the determination of 
rf, and the one which requires the largest number of it- 
erations before it converges. It is generally convenient 
to set the initial hyperparameters to high values so that 
the system is slightly over-regularized. This has the ef- 



fect of smoothing the evidence surface in the 77-space, 
and, therefore, facilitates the search for the maximum. 
The successive loops will then take care of tuning down 
the regularization parameters to more plausible values. 

A comparison of the retrieved non linear parameters 
for M. fi na i (last column of Table [1]) with the correspond- 
ing ones for the reference model reveals that all of them 
are very reliably recovered within a few percent (the most 
skewed one is the inclination i, being only ~ 4% differ- 
ent from the "true" value) . The panels in the second and 
fourth rows of Figures [4] and [5] clearly show that the two 
models indeed look extremely similar, to the point that 
they hardly exhibit any difference when visually exam- 
ined. 

The evidence of the final model, when compared with 
the value for the reference model, turns out to be higher 
by AS = £ fi na i — £ re f = 120, which might look surpris- 
ing since A4 le ( is by construction the "true" model. The 
explanation for this is given by the cumulative effects 
of numerical error (which enters mainly in the way the 
TICs are generated 12 ), finite grid size, and noise (in par- 
ticular on the velocity moments), which all contribute to 
slightly shift the model. Such a shift is, however, well 
within the model uncertainties (Sect. 16. 4[) and therefore 
not significant. 

6.3.1. Degeneracies and Biases 

Extensive MCMC exploration reveals that the lensing 
evidence surface £i on (?7) is characterized by multiple lo- 
cal maxima which effectively are degenerate even for very 
different values of ff. Indeed, for the potential Eq. (f2"S")) . 
one can easily observe from Eq. (|F1|) that a relation be- 
tween q, i, and <&o exists that leaves the deflection angles 
and, hence, the lens observable invariant. 

In this situation, the dynamics plays a crucial role 
in breaking the degeneracies and reliably recovering the 
best values for the non linear parameters. A clear exam- 
ple is shown in Figure [51 where cuts along the evidence 
surfaces are presented. The left panel of Figure[6]displays 
two almost degenerate maxima in fi en ; indeed, the maxi- 
mum located at the linear coordinate e = (correspond- 
ing to the set of non linear parameters i = 35°, a = 5.59, 
(3 = 0.297, q = 0.602) has a value for the lensing evidence 
of —22661, and therefore, judging on lensing alone, this 
model would be preferred to the reference model (see Ta- 
ble [T]) which corresponds to the other maximum at e = 1. 
When the evidence of the dynamics (Fig. [6] central panel) 
is considered, however, the degeneracy is broken and the 
"false" maximum is heavily penalized with respect to the 
true one, as shown by the resulting total evidence surface 
(Fig. [6] right panel). 

Test cases with a denser sampling of the integral space 
(e.g. N E x N Lz = 12 x 8, 18 x 9, 20 x 10) were also consid- 
ered. This analysis revealed that, although one obtains 
more detailed information about the reconstructed two- 
integral distribution function at the cost of a longer com- 
putational time (e.g. in the 18 x 9 case the loop phase 



12 In fact, when in this same case the TICs are ; 
10 times the number of particles, i.e. JVpic = 10 b 



populated with 
the evidence 

increases for both A4 re { and .Mfl na i, and the reference model is 
now favored (of a A£ = £ rc { — £fl na i ~ 30) as expected. Obviously, 
this comes at a significant cost in computational time, with the dy- 
namical modeling becoming approximately an order of magnitude 
slower. 
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Fig. 5. — Outcome of the non linear optimization routine for the dynamics block. The first row shows the simulated set of TIC weights 7 
(i.e. the weighted distribution function) in the integral space and the resulting observables: the surface brightness E, the line-of-sight 
streaming velocity (v z i) and the line-of-sight velocity dispersion <rf . The second row presents the reference model A4 Ie f (i.e. generated 
with same ff set of parameters which were used to generate the mock data). In the second row is the starting model Mo obtained with the 
skewed parameters, which produces a poor reconstruction of the observables, in particular the velocity moments. The third row displays 
the final model .Mfinab that is the "best model" recovered through the non linear optimization for evidence maximization. This should 
be compared to the reference model M IC { (i.e. generated with same if set of parameters which were used to generate the mock data), 
presented in the second row of Fig. \E\ The last row shows the residuals of the final model when compared to the simulated data. Refer to 
Table [T] for the non linear parameters and hyperparameters of the models and the corresponding value of the evidence. 



is slower by more than a factor of 2), the accuracy of 
the recovered non linear parameters - which is what we 
are primarily interested in - does not (significantly) im- 
prove. As a consequence, the iV 7 = 100 case is assumed 
to be the standard grid for the integral space as far as 
the dynamical modeling is concerned to give an unbiased 
solution. Once the non linear parameters have been re- 



covered through the evidence maximization routine, it 
is possible to start from the obtained results to make 
a more detailed and expensive study of the distribution 
function using a higher number of TICs. 

6.4. Model uncertainties 

To determine the scatter in the recovered parameters ff 
we performed the full non linear reconstruction on a set 
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TABLE 2 

Summary of the results of the non linear reconstruction 
algorithm applied on a set of 100 random realizations of 
the test data. 





median 


mean 


95% 


C.I. 


-Mrof 


i [deg] 


63.0 


63.3 


[59.5, 
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60.0 


non linear ao 


4.02 


4.02 


[3.94, 
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4.05 


parameters 


0.276 
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[0.266, 


0.293] 


0.280 


q 


0.861 


0.861 


[0.849, 


0.873] 


0.850 



Fig. 6. — These plots show how the degeneracy between different lens models is broken when the constraints given by the stellar dynamics 
are also considered. The three panels display a cut through the surfaces of, respectively, lensing evidence, dynamics evidence and total 
evidence. The abscissa coordinate e represents a series of different if sets, i.e. different models, obtained as a linear interpolation between 
the model parameters fj c= o = (i = 35°, a = 5.59, /3 = 0.297, q = 0.602), i.e. the best model which is obtained by maximizing for lensing 
evidence only, and the model rf £= i which is the reference model -M re f of Table [T] As far as only the gravitational lensing is considered, 
the two maxima in the evidence are effectively degenerate, with the "wrong" model rf e= o being slightly preferred (A£i cn = 2). When the 
contribution of the evidence of dynamics is considered, however, the degeneracy is broken and the reference model emerges as indisputably 
favored by the total evidence (A£tot — 35000). 

of 100 random realizations of the test data. The statisti- 
cal analysis of the results is summarized in Table [5J On 
examining the 95% confidence intervals, the parameters 
appear to be quite tightly constrained, with the partial 
exception of the inclination angle which spans an interval 
of approximately 10°. The means (and the medians) of 
the parameters a, (3 and q are very close to the "true val- 
ues" of the reference model, while the mean of parameter 
i is slightly more skewed. The two sets of parameters (i, 
q) and (a, (3) are clearly significantly correlated when 
plotted against each other (see Fig. [7] for a graphical rep- 
resentation of the correlation matrix) . Both correlations 
can be understood. The former is due to the fact that 
the projected lens potential is nearly invariant when the 
inclination is increased and flattening is decreased simul- 
taneously; whereas, in the latter case the requirement 
that the lens mass enclosed by the lensed images be very 
similar between lens models causes the lens strength and 
density slopes to vary in concordance. 

Although the starting parameters of model Aio are 
very different from the "true" solution given by ./Vf r ef, 
in all the considered cases the recovered parameters end 
up close to the A4 T e{ ones, and there is no case in which 
the solution remains anchored in a really far-off local 
minimum. Hence, even in the case of our chosen lens 
potential, which has global degeneracies in the lensing 
observables (see Section [6.3.1[) . these degeneacies are bro- 
ken through the inclusion of stellar kinematic data. This 
indeed shows that the combination of lensing and stellar 
kinematic information is a very promising tool to break 
degeneracies of the galaxy mass models. 



7. CONCLUSIONS AND FUTURE WORK 

We have presented and implemented a complete frame- 
work to perform a detailed analysis of the gravitational 
potential of (elliptical) lens galaxies by combining, for 
the first time, in a fully self-consistent way both gravita- 
tional lensing and stellar dynamics information. 

This method, embedded in a Bayesian statistical 
framework, enables one to break to a large extent the 
well-known degeneracies that constitute a severe hin- 



drance to the study of the early-type galaxies (in par- 
ticular those at large distances, i.e. z > 0.1) when the 
two diagnostic tools of gravitational lensing and stellar 
dynamics are used separately. By overcoming these dif- 
ficulties, the presented methodology provides a new in- 
strument to tackle the major astrophysical issue of un- 
derstanding the formation and evolution of early-type 
galaxies. 

The framework is very general in its scope and in prin- 
ciple can accommodate an arbitrary (e.g. triaxial) poten- 
tial $(x). In fact, if a combined set of lensing data (i.e. 
surface brightness distribution of the lensed images) and 
kinematic data (i.e. surface brightness distribution and 
velocity moments maps of the galaxy) is provided for an 
elliptical lens galaxy, it is always possible, making use 
of the same potential, to formulate the two problems of 
lensed-image reconstruction and dynamical modeling as 
sets of coupled linear equations to which the linear and 
non linear optimization techniques described in Sect. [3] 
and [5] can be directly applied. 

More specifically, in the case of gravitational lensing 
the non-parametric source reconstruction method (as il- 
lustrated in Sect. 2]) straightforwardly applies to the gen- 
eral case of any potential In the case of dynamical 
modeling, a full Schwarzschild method with orbital inte- 
gration would be required; this would constitute, how- 
ever, only a mere technical complication which does not 
modify the overall conceptual structure of the method. 
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Fig. 7. — Graphical visualization of the lower triangle of the (symmetric) correlation matrix for the parameters recovered from the non 
linear reconstruction of 100 random realizations of the test data. In the shown panels the four non linear parameters i, a, f3 and q are 
plotted two by two against each other; for each panel, the corresponding value pij of the correlation matrix is also indicated. 



7.1. The CAULDRON algorithm 

In practical applications, however, technical difficul- 
ties and computational constraints must also be taken 
into account. This has motivated the development, from 
the general framework, of the specific working implemen- 
tation described in this paper, which restricts itself to 
axisymmetric potentials Q(R, z) and two-integral stellar 
phase-space distribution functions. This choice is an ex- 



cellent compromise between efficiency and generality. 13 
On one hand it allows one to study models which go far 
beyond the simple spherical Jeans-modeling case, and 
on the other hand, it has the invaluable advantage of 

13 This is particularly true also in consideration of the currently 
available data quality for distant early-type galaxies, for which the 
information about the projected velocity moments is in general 
not very detailed, and could not reliably constrain a sophisticated 
dynamical model. 
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permitting a dynamical modeling by means of the two- 
integr als Schwarzschild method of IVerolme fc de Zeeuwl 
(2002). This method (see Sect. [5J is based on the su- 
perposition of elementary building blocks (i.e. the TICs) 
directly obtained from the two- integral distribution func- 
tion, which do not require computationally expensive or- 
bit integrations. 

More specifically, we have sped up this method by sev- 
eral orders of magnitude by designing a fully novel Monte 
Carlo implementation (see Appendix [Cj) . Hence, we are 
now able to construct a realistic two-integral dynamical 
model and its observables (surface brightness and line-of- 
sight projected velocity moments) in a time of the order 
of 5 — 15 s on a typical 3 Ghz machine. 

The Bayesian approach (e.g. Sect. [2]) constitutes a fun- 
damental aspect of the whole framework. On a first level, 
the maximization of the posterior probability, by means 
of linear optimization, provides the most probable solu- 
tion for a given data set and an assigned model (which 
will be, in general, a function of some non linear parame- 
ters if, e.g. the potential parameters, the inclination, the 
position angle) in a fast and efficient way. A solution (i.e. 
the source surface brightness distribution for lensing and 
the distribution function for dynamics), however, can be 
obtained for any assigned model. The really important 
and challenging issue is instead the model comparison, 
that is, to objectively determine which is the "best" 
model for the given data set or, in other words, which 
is the "best" set of non linear parameters if. Bayesian 
statistics provides the tool to answer these questions in 
the form of a merit function, the "evidence" , which nat- 
urally and automatically embodies the principle of Oc- 
cam's razor, penalizing not only mismatching models but 
also models which correctl y predict the data bu t are un- 
neces sarily complex (e.g. iMacKavl I1992L Il999t iMackavl 
2003). The problem of model comparison thus becomes 
a non linear optimization problem (i.e. maximizing the 
evidence), for which several techniques are available (see 
Sect. [mi). 

As reported in Sect. El we have conducted successful 
tests of the method, demonstrating that both the lin- 
ear reconstruction and the non linear optimization algo- 
rithms work reliably. It has been shown that it is pos- 
sible to recover within a few percent the values of the 
non linear parameters of the reference model (i.e. the 
"true" model used to generate the simulated data set), 
even when starting the reconstruction from a very skewed 
and implausible (in terms of the evidence value) initial 
guess for the non linear parameters. Such an accurate 
reconstruction is a direct consequence of having taken 
into account, beyond the information coming from grav- 
itational lensing, the constraints from stellar dynamics. 
Indeed, when the algorithm is run considering only the 
lensing data, degenerate solutions with comparable or 
almost coincident values for the evidence are found, 1 
making it effectively impossible to distinguish between 
these models. The crucial importance of the informa- 
tion from dynamics is exhibited by the fact that, when 
it is included in the analysis, the degeneracies are fully 

14 Because of the presence of noise in the data, numerical errors 
and model inaccuracies, solutions for the non linear parameters 
which differ from the parameters of the reference models can be 
slightly favored by the evidence. 



broken and a solution very close to the true one is un- 
ambiguously recovered (see Fig. [6] for an example). 

Bearing in mind these limitations and their conse- 
quences, however, it should also be noted that the full 
modularity of the presented algorithm makes it fit to be 
used also in those situations in which either the lensing or 
the kinematic observables are not available. This would 
allow one, for example, to restrict the plausible models to 
a very small subset of the full space of non linear param- 
eters, although a single non-degenerate "best solution" 
would probably be hard or impossible to find. 

7.2. Future work 

Eventhough the methodology that we introduced in 
this paper works very well and is quite flexible, we can 
foresee a number of improvements for the near and far 
future, in order of perceived complexity, (i) Exploration 
of the errors on the non linear parameters if through a 
MCMC method, even though this requires an extension 
of the MCMC framework in the context of the Bayesian 
evidence, since one also needs to explore variations in the 
lens parameters due to changes in the hyperparameters, 
and not only changes in the posterior for fixed hyper- 
parameters. (ii) Implementation of additional potential 
(or density) models or even a non-parametric or multi- 
pole expansion description of the gravitational potential 
in the (R, z)-plane for axisymmetric models. This allows 
more freedom for the galaxy potential description, (iii) 
Implementing an approximat e three-integral method i n 
axisymmetric potentials (e.g. lDehnen fe; Gerhardl fl993). 
(iv) Including an additional iterative loop around the 
posterior probability optimization, one can construct 
stellar phase-space distribution functions that are self- 
consistent, i.e. they generate the potential for which 
they are solutions to the collisionless Boltzmann equa- 
tion. This would allow the stellar and dark-matter po- 
tential contributions to be separated, a feature not yet 
part of the current code, (v) A full implementation of 
Schwarzschild's method for arbitrary potentials through 
orbital integration. 

Besides these technical improvements, which are all be- 
yond the scope of this methodological paper, we also 
plan, in a future publication, a set of additional per- 
formance tests to see to what level each of the degen- 
eracies (e.g. the mass sheet and mass anisotropy) in 
lensing and stellar dynamics are broken and how far 
the simpler lensing plus s pherical Jeans approac h (e.g. 
iTreu fc Koopmansl [200l IKoopmans et all l2006f ) gives 
(un)biased results. Such studies would allow us to bet- 
ter interpret the results obtained in cases where spatially 
resolved stellar kinematics is not available (e.g. for f aint 
very high redshift systems; IKoopmans fc Treij[2002ll . 

As for the application, the algorithm described in this 
paper will be applied in a full and rigorous analysis of the 
SLACS samp le of massive early-type l ens galaxies (see 
Bolton et all 120061 ; ITreu et all 120061 ; IKoopmans et all 
2006|) for which the available data include Hubble Space 
Telescope Advanced Camera for Survey and NICMOS 
images of the galaxy surface brightness distribution and 
lensed image structure and maps of the first and sec- 
ond line-of-sight projected velocity moments (obtained 
with VLT-VIMOS two-dimensional integral field spec- 
troscopy, as part of the Large Program and as a series of 
Keck long-slit spectra). 
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APPENDIX 
A. REGULARIZATION 

We make use of a curvature regularization. This form of reg ularization tries to put the second derivative between a 
pixel and the two adjacent ones to zero and has been shown bv lSuvu et al. (2006) to be optimal for the reconstruction 
of smooth distributions. The curvature regularization has been chosen since we do not expect, in the majority of 
cases, to have sharp intensity variations in the surface brightness distribution of an extended source (for lensing) or in 
the distribution function of a galaxy. Howeve r, other choices of regularization can easily be implemented and ranked 
according to their evidence (|Suvu et al.1 12006). 

Following the notation of Sect. [2J we indicate as x the linear parameter vector (or, more simply, the source) and 
as H the regularization matrix. Since the source is defined on a rectangular grid of N = N mw x N co \ elements, H 
actually consists of two matrices, H row and H co i, which regularize the grid pixels along the rows and the columns, 
respectively. The horizontal regularization operator H row is a square matrix of rank N. In each row i the only non 
zero elements are = +X ) hii = — 2, /ij^+i = +1; the only exceptions are the rows 1 + kN co \ and iV co i + kN co \ 

(with k = 0, 1, . . . , N row — 1), where a zeroth-order regularization is performed (i.e. hij = 1 is the only non zero term) 
to prevent the connection of pixels belonging to different rows which are therefore physically uncorrelated. Similarly, 
the vertical regularization operator H co i is constructed such that in each row i all the elements are zero with the 
exclusion of jv^ = +X,hi t i = — 2, hi^ + N col = +1; a zeroth-order regularization is applied for the rows 1 . . . N co \ 
and N - N coi + X.T.N. 

B. NORMALIZATION: SETTING THE SCALE OF LENSING 

For the assigned three-dimen sional gravitational potential $, the reduced deflection angle a is given by (e.g. 
iSchneider. Ehlers. fe Falcol[l992h 

c 1 D s J_ oc « 

where z' is the line-of-sight coordinate, (x',y') = £ are the sky coordinates and V|- denotes the two-dimensional 

1/2 

gradient operator in the plane of the sky; c is the speed of light expressed in the same units as the value of | <& | ' . 
If the gradient operator, which does not depend on z', is taken out of the integral and the potential is conveniently 
written as 

$(i;« / ) = $ox$(|;« / ) 1 (B2) 

where $o is the normalization constant in the most suitable physical unit (in our case km s _1 ) and $ is a function of the 
dimensionless coordinates expressed as angles in arcseconds (£ = £ £/D&, z' = 648 J )0 ° z ' / D<±), then the deflection 

" + 00 



angle assumes the expression 

J — CO 



a(x ,y) = a V-- 



(B3) 



where 



6.48 x 10 5 2$ D ds 

a ^^^~ -dT (B4) 



is the lens strength in arcseconds. The parameter ao sets the scale for the lensing and, therefore, is always included in 
the parameter vector ff. Eq. (|B4j) openly displays how intimately the lens strength is connected to the normalization 
of the three-dimensional potential (the same used for the dynamical modeling) within our joint method. 

C. A MCMC IMPLEMENTATION OF THE TWO-INTEGRAL AXISYMMETRIC SCHWARZSCHILD METHOD 

In this Appendix we describ e the numerical implementation of the two-integral axisymmetric Schwarzschild method 
of IVerolme fc de Zeeuwl (|2002f) that we developed to significantly accelerate the construction of the dynamical model, 
i.e. the projected and PSF- convoluted model observables generated by the TIC library. 

As a first step, we construct a library composed of = Ne x Nl^ TICs in the given potential <& (here Nl, is an 
even number). We consider a grid (linear or logarithmic) in the circular radius R c with Ne samplings between i? c ,min 
and -R c ,max- The range is chosen to include most of the mass or, if the mass is infinite for the potential <3>, to provide a 
satisfactory sampling of the potential profile in the radial direction. For each R c the circular speed v c is calculated as 



v 2 c {R c ) = R c || 



(CI) 

(flc,0) 



and the angular momentum of the circular orbit £ Zima x = R c v c is set. Computing the energy E c = E(R C ) = V e g(R c , 0) 
of the circular orbit at R c , the radial grid is immediately translated into a sampling in energy. For each value of E c , 
the grid in the normalized angular momentum -q = L z / X Zimax is constructed by sampling (linearly or logarithmically) 
NlJ2 values between the minimum 77 m ; n = and the maximum ?7 max = 1. (For numerical reasons, the grid is actually 
not sampled between these extrema, but between 7y m ; n = e and ?7 max = 1 — e, with e <^ 1). To take the odd part of the 
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distribution function into account as well, we likewise consider the Nl z /2 negative values for the angular momentum, 
77 = —1 . . .77 = 0, on a mirror grid. 

We also need to define a suitable coordinate frame for the galaxy. Since the system is axisymmetric, we adopt the 
cylindrical coordinates (R, tp, z), with the origin on the center of the galaxy. If the galaxy is observed at an inclination i 
and $pa is the position angle (defined as the angle measured counterclockwise between the north direction and the 
projected major axis of the galaxy), the projected coordinates (x',y',z') are given by 

x' = i?(cos $pa sin tp + sin z?pa cos i cos tp) — z sin -#pa sin i 

y' = i?(sin i3pa sin tp — cos d-pA cos i cos (p) + z cos $pa sin i (C2) 
z' = R sin i cos tp + z cos i. 

Here z' is measured along the line of sight, while x' and y 1 are in the plane of sky and are directed (respectively) along 
the projected major and minor axes of the galaxy. 

For any TIC, we populate the surface inside the zero velocity curve with Vtic particles of mass (or equivalcntly 
luminosity) m by means of a Markov-Chain Monte Carlo routine whose probability distribution is given by Eq. (|21|) 
for the density. This effectively corresponds to numerically reproducing the density p(R, z), fixing at the same time the 
total mass Mtic = mNuc for each TIC. However, since p(R,z) cx 1/R, the surface density of the torus "wrapped" 
onto the meridional plane (denoted as <r) is constant, i.e., 

.2* { 2Tt 2 C,j inside ZVC 

Sj(R,z)= / pj(R,z)Rdtp = 
Jo 



(C3) 

elsewhere. 



One can take advantage of this property to greatly simplify the Markov-Chain Monte Carlo routine. Now it is only 
necessary to uniformly populate the meridional plane. For each particle a pair of coordinates (R, z) (within some 
interval which encompasses the ZVC) is randomly generated. If it falls outside the ZVC, the particle is "rejected" and 
another one is generated. If the coordinates are located inside the ZVC, a random value in the interval [0, 2tt) is chosen 
for the azimuthal coordinate in order to have a complete tern (R, tp, z), and the particle counts toward the total of Vtic 
drawings. This procedure yields at the same time the surface Azycj enclosed by the ZVC in the meridional plane 
(effectively obtained via Monte Carlo integration), which is required for the normalization of the jj (see Appendix |D|) . 

With this method the computation of all the projected quantities is fast and straightforward. For each "accepted" 
mass point we know the cylindrical coordinates (R,tp,z); associated with it are also the velocity moments defined 
by Eqs. ([22]) - ([24 ]) . Using the first two equations of the transformation from equation (IC2j) . the projected coordinates 
(x', y') are directly calculated. Casting the points on a grid on the sky plane and summing up all the points in the same 
pixel then reproduces numerically the projected surface brightness distribution Ej (see Figure [8] for an illustration). 
The line-of-sight velocity moments associated with each point in the sky plane (but possibly on a different grid) are 
obtained in an analogous way (but making use now of the third equation of |C2j ) from the corresponding unprojected 
quantities 

(v z >) — — (v v ) sini simp, (C4) 

(v 2 z ,) = ((«!) cos 2 tp + (v 2 ) sin 2 tp) sin 2 i + (v 2 R ) cos 2 i. (C5) 

In analogy with the surface brightness, the first and second line-of-sight moments associated with each mass point 
inside a given pixel are summed up. This gives the quantities and E(v 2 ,). 

The effect of the PSF is taken into account by simply convolving the projected surface brightness or weighted velocity 
moments calculated on their respective grids (preferably oversampled) with the PSF profile sampled on the same grid. 
This operation can be numerically performed in a very efficient way through several FFTs (fast Fourier transforms). 

This numerical implementation is dramatically faster than the semi analytic approach (at the expense of some 
numerical noise). On a machine with a 3 Ghz processor, the whole process of calculating the projected quantities E, 
(v z r), and (w 2 /) convolved with the PSF for 1400 TIC s takes about 3 minute s (with Vtic = 10 5 ). This figure should 
be compared with the ~ 30 minutes required bv lVerolme fc de Zeeuwl ((2002) to calculate (on a 1 GHz machine) only 
the projected density without PSF convolution for an equal number of TICs. 

D. NORMALIZATION OF THE TICS WEIGHTS 

In this Appendix we illustrate how the reconstructed adimensional weights jj are translated into the dimensional 
distr ibution function values DF (£',, L z j). For a two-integral distribution function, the density is given by the formula 
(e.g. lBinnev fc Tremainelll987ft 

z) = 2nRp(R, z) = 4tt 2 f dE [ VF{E,L z )dL z , (Dl) 

JO JLl<2{V-E)R 2 

where <;(R, z) is the surface density "wrapped" in the meridional plane (cf. definition |C3| ). If we assume that DF can 
be considered approximately constant over the cell dEjdL z j of area A[ Ej Lz .] in the integral space, and we remember 
the properties of the TICs, then we have 

47r 2 DF(£ i) Z^)AEidZ Zlj = z y (D2 ) 
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TIC[E=fl, Lz-3], i = 0, PA — 0, not convolved with PSF 



TIC[E=8, Lz=3], i = 0, PA — 0, convolved with PSF 




X [arcsec] 

TIC[E=8, Lz=3], i - 45, PA - 0, not convolved with PSF 



X [arcsec] 

TIC[E=8, Lz=3], i = 45, PA - 0, convolved with PSF 




X [arcsec] 



X [arcsec] 



Fig. 8. — As an illustration of the method described in the text, we show for a given TIC (obtained with Ntic = 5 x 10° particles) the 
density distribution projected, from top to bottom, at i = 0° (face-on), i = 45°, and i = 90° (edge-on). In the left column the density 



In the previous formula, ^ is the "wrapped" surface densi ty g enerated by TIC.,, specified by the pair (Ej,L z j), 
which is constant inside the ZVC and zero elsewhere (see Eq. |C3| V. 



27r 2 C, 



A/tic 
Azvc.j 



(inside the ZVC); 



(D3) 



here Azvc. j denotes the area enclosed by the ZVC in the meridional plane (which can be calculated as described in 
Appendix [Cj and Mtic = mTVxic is the fixed TIC mass (all the TICs have equal mass by construction). 
Combining Eqs. (|D2[) and (|D3[) we find the desired relation 



BF(Ej,L z 



7j 



miV T ic 



Air 2 A? 



zvc,jA[ Ej x z j 



(D4) 
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which translates the weights jj into distribution function values expressed in the standard physical units of mass 
length" 3 velocity -3 . If m, in the numerator of the right-hand side of Eq. (|D4[) . is omitted or divided by a mass-to-light 
ratio coefficient T, the resulting distribution function is expressed in terms of (respectively) number or luminosity 
phase-space density. 

E. THE EVIDENCE FORMULA 

In this Appendix we use the same notation as Sect. [2j and we indicate as 7V X and iVb, respectively, the number of 
elements in the linear parameter vector x and in the data vector b. If the assumptions made in Sect. [51 viz., Gaussian 
noise and quadratic functional form of the regula rization term E-iz (x) with minimum in x rcg = 0, are valid, then the 
logarithm of the evidence has the expression (e.g. ISuvu et~aT1l2006f ) 



logP(6|A,A,H)=-~(Af - bf C-^Ax - b) - ^||Hf|| 2 - - log [det (A T C -1 A + AH T H)] 

+ ^ log A + \ log [det (H T H)] - ^ log(27r) + I log (det CT 1 ) . 

The expression of the evidence in the case of lensing and dynamics is immediately obtained by rewriting Eq. (|E1 
with the notation of Sections 0] and [5] 



(El) 



F. THE DEFLECTION ANGLE FOR EVANS' POWER-LAW GALAXIES 

All the re leva nt quantities for the Evan s' power-law galaxy models which are used in Sect. [6] are analytic (refer to 
Evans 1993 and lEvans &: de Zeeuwl[T994l for the full expressions). The lensing deflection angle a can be calculated 
from the potential from Eq. (f2"5)) . resulting in: 



V 



20F£>ds r 



2 



c 2 Od r (^) 



{Rl + x' 2 + y' 2 /q' 2 ) 2 



(Fl) 



2 



D. 



d r 



y'/q' 



q (i? s 2 + z' 2 + y' 2 A?' 2 )^ 



where (x',y') are the coordinates in the sky plane (which is defined as the plane orthogonal to the line of sight z') 
q' = y cos 2 i + q 2 sin 2 i is the projected axis ratio, and T is the gamma function. 



